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WHAT DID FISHER MEAN BY AN ESTIMATE? 
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University of Turku 

'66 '. 
(«-»* Fisher's Method of Maximum Likelihood is shown to be a proce- 

f^ , dure for the construction of likelihood intervals or regions, instead 

rvj ■ of a procedure of point estimation. Based on Fisher's articles and 

books it is justified that by estimation Fisher meant the construction 
of likelihood intervals or regions from appropriate likelihood function 
and that an estimate is a statistic, that is, a function from a sample 
£Nl , space to a parameter space such that the likelihood function obtained 

£Nj ■ from the sampling distribution of the statistic at the observed value 

of the statistic is used to construct likelihood intervals or regions. 
Thus Problem of Estimation is how to choose the 'best' estimate. 
Fisher's solution for the problem of estimation is Maximum Likeli- 
hood Estimate (MLE). Fisher's Theory of Statistical Estimation is a 
^ C*| chain of ideas used to justify MLE as the solution of the problem of 

estimation. 

The construction of confidence intervals by the delta method from 
the asymptotic normal distribution of MLE is based on Fisher's ideas, 
but is against his 'logic of statistical inference'. Instead the construc- 
tion of confidence intervals from the profile likelihood function of a 
^. ■ given interest function of the parameter vector is considered as a 

[~~. ' solution more in line with Fisher's 'ideology'. A new method of cal- 

0^ ■ culation of profile likelihood-based confidence intervals for general 

C**) ' smooth interest functions in general statistical models is considered. 

en 

(^ ' 1. Introduction. 'Collected Papers of R. A. Fisher' (Bennet 1971) con- 

OO ! tains 294 articles. In eight of these (Fisher 1922, 1925b, 1932, 1934, 1935b, 

^^ ■ 1936, 1938, 1951) Fisher considers and explicitly mentions "problem of esti- 

mation" . In fact the title of the second of those articles is 'Theory of statis- 
tical estimation'. Three of his books (Fisher 1925a, 1935a, 1956) all have a 
final chapter in which Fisher considers statistical estimation. In his last book 
'Statistical Methods and Scientific Inference' on page 143 Fisher, however, 
writes that 

A distinction without a difference has been introduced by certain writers who 
distinguish "Point estimation", meaning some process of arriving at an es- 
timate without regard to its precision, from "Interval estimation" in which 
precision of the estimate is to some extent taken into account. "Point esti- 
mation" in this sense has never been practised either by myself, or by my 
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2 ESA UUSIPAIKKA 

predecessor Karl Pearson, who did consider the problem of estimation in some 
of its aspects, or by his predecessor Gauss of nearly one hundred years earlier, 
who laid the foundations of the subject. 

In his famous R. A. Fisher Memorial Lecture (Savage 1976) in the year 
1970 L. J. Savage reacts to this by saying that 

By "estimation," Fisher normally means what is ordinarily called point es- 
timation. ... The term "point estimation" made Fisher nervous, because he 
associated it with estimation without regard for accuracy, which he regarded 
ridiculous and seemed to believe that some people advocated; . . . 

and 26 years later B. Efron says in his R. A. Fisher Memorial Lecture (Efron 
1998) that 

Fisher's great accomplishment was to provide an optimality standard for sta- 
tistical estimation - a yardstick of the best it's possible to do in any given 
estimation problem. Moreover, he provided a practical method, maximum 
likelihood, that quite reliably produces estimators coming close to the ideal 
optimum even in small samples. 

Even though Efron does not explicitly mention 'point estimation' later in 
his talk he speaks about maximum likelihood estimate and its estimated 
standard error, that is, about the well-known short hand expression 

(1.1) 9±se § 

(actually Efron considered approximate confidence intervals based on (1.1)). 
Savage's reaction may be interpreted as saying that hardly anybody uses 
point estimates without giving their estimated standard errors. 

Fisher's comment on point estimation in Chapter 'The principles of es- 
timation' of his last book is puzzling because he himself uses in his articles 
and books heavily the notation (1.1) starting from his first article (Fisher 
1912) on. The crucial word in that comment is 'precision'. On page 158 of 
the same chapter Fisher writes 

The study of the sampling errors, that is, of the precision, of statistical esti- 
mates, . . . 

This quotation indicates that precision of an estimate meant for Fisher more 
than plain standard error. 

In Section 2 Fisher's theory of statistical estimation is discussed. First 
concepts and notation used in this article will be introduced. Then a possible 
answer to the question in the title of this article, that is, to the question 
about the meaning of an estimate is given. This answer also shows that 
Fisher interpreted (1.1) differently and that difference explains his comment 
on point estimation and the difference of views between Fisher and all those 
who practice point estimation. Section 3 presents the prevailing view of 
maximum likelihood estimation (MLE) and shows that it is based on Fisher's 
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WHAT DID FISHER MEAN BY AN ESTIMATE? 3 

ideas but is in conflict with his ideology of statistical inference. In Section 4 
current state of Fisher's theory of estimation is reviewed. In the next section 
the special case of real- valued interest function is considered in more detail. 
In writing this article two important decisions have been made. Firstly, 
because of the nature of the message of the article many quotations from 
Fisher's articles and books have been included. In connection of these ad- 
ditional citations to places which contain similar material are given. Sec- 
ondly, technical material, for example, proofs of various propositions have 
not been included. These can be found in Fisher's publications and in the 
large literature discussing Fisher personally, his influence in statistics, and 
publications based on Fisher's scientific ideas. Historical material on Fisher 
includes Stigler (1973, 2005), Edwards (1974, 1997a, 1997b), Box (1978), 
Zabell (1989, 1992), and Aldrich (1997). Barnard (1963), Savage (1976), Rao 
(1992), and Efron (1998) are reviews about Fisher's influence on statistics. 
Early articles promoting likelihood-based inference include Bartlett (1936), 
Barnard (1948, 1951), Barnard et. al. (1962), Box and Cox (1964), and 
Kalbfleisch and Sprott (1970). Hacking (1965), Edwards (1972), Cox and 
Hinkley (1974), Barndorff-Nielsen (1988), Barndorff-Nielsen and Cox (1989, 
1994), Lindsey (1996), Pace and Salvan (1997), Severini (2000), and Pawitan 
(2001) are books discussing the likelihood approach to statistical inference 
and contain further links to literature on Fisher. 

2. Fisher's theory of statistical estimation. 

2.1. Statistical evidence. 

Response and its statistical model. Statistical inference is based on sta- 
tistical evidence, which has at least two components. Two necessary compo- 
nents consist of the observed response y Q ^ s and its statistical model M. 

Because by definition the response contains random variation the actual 
observed value of the response is one among a set of plausible values for 
the response. The set of all possible values for the response is called sample 
space and denoted by y. The generic point y in the sample space is called 
response vector. 

The random variation contained in the response is described by either a 
point probability function or a density function defining a probability dis- 
tribution in the sample space y. Because the purpose of statistical inference 
is to give information on the unknown features of the phenomenon under 
study these unknown features imply that the probability distribution of the 
response is not completely known. Thus it is assumed that there is a set 
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4 ESA UUSIPAIKKA 

of probability distributions capturing the important features of the phe- 
nomenon, among which features of primary interest are crucial, but also 
so-called secondary ones are important. It is always possible to index the set 
of possible probability distributions and in statistical inference this index is 
called parameter. The set of all possible values for the parameter is called 
the parameter space and denoted by f2. The generic point u in the parameter 
space is called the parameter vector. 

In this article only so-called parametric statistical models are considered. 
In parametric statistical models the value of the parameter is determined by 
the values of finitely many real numbers, which form the finite dimensional 
parameter vector. Thus the parameter space is a subset of some finite di- 
mensional space TZ where d is a positive integer denoting the dimension of 
the parameter space. 

There is a probability distribution defined for every value of the parameter 
vector u>. Thus a statistical model is defined by a model function p(y;u), 
which is a function of the observation vector y and the parameter vector 
u> such that for every fixed value of parameter vector the function defines 
a distribution in the sample space y. The statistical model consists of the 
sample space y, the parameter space Q, and the model function p(y;uj). It 
is denoted by 

(2.1) M = {p(y;u;):yey,u;en} 

or by the following mathematically more correct notation 

(2.2) M={p(-,uj):uj£n}, 

where p(-;ui) denotes the point probability or density function on the sample 
space y defined by the parameter vector uj. 

Statistical inference. Statistical inference concerns some characteristic or 
characteristics of the phenomenon from which the observations have arisen. 
The characteristics of the phenomenon under consideration are some func- 
tions of the parameter and are called the parameter functions of interest. If 
a subset of the components of the parameter vector form the interest func- 
tions, then the rest of the components are said to be nuisance parameters. 

The result of a statistical inference procedure is a collection of statements 
concerning the unknown values of the parameter functions of interest. The 
statements are based on statistical evidence. The important problem of the 
theory of statistical inference is to characterize the form of statements that 
a given statistical evidence supports. 
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WHAT DID FISHER MEAN BY AN ESTIMATE? 5 

The uncertainties of the statements are also an essential part of the infer- 
ence. The statements and their uncertainties are both based on statistical 
evidence and together form the evidential meaning of the statistical evi- 
dence. 

Likelihood concepts. For the sake of generality let us denote by A the 
event describing what has been observed with respect to the response. Usu- 
ally A consists just of one point of the sample space, but in certain applica- 
tions A is a larger subset of the sample space. The function Pv(A; u) of the 
parameter u, that is, the probability of the observed event A with respect 
to the distribution of the statistical model is called the likelihood function 
of the model at the observed event A and is denoted by Lm{ui\ A). If the 
statistical model consists of discrete distributions and the observed event 
has the form A = {y}, the likelihood function is simply the point proba- 
bility function p{y;ui) as a function of uj. If, however, the statistical model 
consists of absolutely continuous distributions, the measurement accuracy 
of the response has to be taken into account and so the event has the form 
A = {y : y — d~ < y < y + 5}. Assuming for simplicity that the response is 
one-dimensional the likelihood function is 

Lm{^;v) = Pt(A;u) 

= Pr(y — 5 < y < y + 5;uj) 

~ p(y;^)25, 

which depends on uj only through the density function. Suppose that the 
response is transformed by a smooth one-to-one function h and denote 
z = h(y). Let g be the inverse of h. Then with respect to the transformed 
response 

Lm(lu;z) = Pr(z — 5 < z < z + 8) 

= Pr( 5 (z - 5) < y < g{z + 5)) 
w p(y;oo)2\g'(h(y))\5, 

which differs from the previous expression, but again depends on to only 
through the original density function. From this follows that in the abso- 
lutely continuous case the likelihood function is up to a constant approxi- 
mately equal to density function considered as the function of the parameter. 
The definition of the likelihood function which applies exactly to the discrete 
case and approximately to the absolute continuous case is 

(2.3) L m (uj; y) = c(y)p(y; uj), 
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where c is an arbitrary positive function. The above discussion generalizes 
to models with higher dimensional parameter spaces. There are, however, 
situations in which the above approximation may fail (Lindsey 1996, pp. 
75-80). In these situations it is wise to use the exact likelihood function, 
which is based on the actual measurement accuracy. 

Occasionally the event A is more complicated. For example, it might be 
known only that the response is greater than some given value y. Then 
A = {y :y > y} and 

Lm{^;v) = Pr(y > y;u>) = 1 - F(y;u>). 

For theoretical reasons, which will be discussed later, instead of the like- 
lihood function it is better to use logarithmic likelihood function or log- 
likelihood function, that is, the function 

(2.4) l M (u>;y) =ln(L M (y;u)). 

Often important information about the behavior of likelihood and log- 
likelihood functions can be obtained from first and second order derivatives 
of the log-likelihood function with respect to the components of the param- 
eter vector. The vector of the first derivatives is called score function and 
negative of the matrix of second derivatives is called observed information 
function. Practically and theoretically the point in parameter space that 
maximizes the value of likelihood is of special importance. 

Let y bs be the observed response and M. = {p(y;u) : y G y, uj £ 
Q} its statistical model. Likelihood ratio L(u>i; y bs) / L(lo2', y Q bs) measures, 
how much more or less the observed response y b s supports the value uj\ 
of the parameter vector compared to the value io<i- Because the likelihood 
ratio does not change, if the likelihood function is multiplied by a number 
not depending on the parameter vector u, it is convenient for the sake of 
various comparisons to multiply it by an appropriate number. A number 
generally used is the reciprocal of the maximum value of the likelihood 
function. The version of the likelihood function obtained in this way is called 
relative likelihood function and it has the form 

(2.5) R{uj;y ohs )- 



L{Cb;y ohs y 



where Co is the value of the parameter vector maximizing the likelihood 
function. The relative likelihood function takes values between and 1 and 
its maximum value is 1 . Logarithmic relative likelihood function or relative 
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Table 1 
Leukaemia data. 
Treatment Time of remission 

"Drug 6+, 6, 6, 6, 7, 9+, 10+, 10, 11+, 13, 16, 17+, 19, 20+, 22 

23, 25+, 32+, 32+, 34+, 35+ 
Control 1, 1, 2, 2, 3, 4, 4, 5, 5, 8, 8, 8, 8, 11, 11, 12, 12, 15, 17, 22, 23 



Fig 1 . Leukemia data: Times of remission in leukemia data. Symbol o denotes times of 
the control group, □ uncensored times and ■ censored times of the drug group. 
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log-likelihood function is the logarithm of the relative likelihood function and 
thus has the form 

(2.6) r(uj\ j/obs) = l{u\ ?/obs) - l{&] yobs)- 

The relative log-likelihood function has its values in the interval (— oo,0) 
and its maximum value is 0. 

Example. Data in Table 1 shows times of remission (i.e. freedom from 
symptoms in a precisely defined sense) of leukemia patients, some patients 
being treated with the drug 6-mercaptopurine (6-MP), the others serving 
as a control (Cox and Oakes 1984, p. 7). The columns of the data matrix 
contain values of treatment and time of remission. Censored times have been 
denoted by + sign. Figure 1 shows values of times of remission. 

As an example consider the 'Drug ' group of the leukemia data. Assume 
that times of remission are a sample from some exponential distribution 
with unknown mean \x. Statistical evidence consists of the response vector 

y obs = (1, 1, 2, 2, 3, 4, 4, 5, 5, 8, 8, 8, 8, 11, 11, 12, 12, 15, 17, 22, 23) 

and statistical model 

M = SamplingModel[ExponentialMode%],21] 

imsart-aap ver. 2008/01/24 file: ejs_2008_277.tex date: July 22, 2008 



ESA UUSIPAIKKA 



Fig 2. Leukemia data: Relative likelihood function. 
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with sample space y = 1Z 21 , parameter space Q, = (0, oo), and model func- 
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where y denotes the sample mean. 
Likelihood function has the form 



L (//; y, 
and log-likelihood function the form 



obsy 91 
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1 182 

e ^ 



li^Vobs) = -211n(/x) 



182 



Relative likelihood and relative log-likelihood functions are 

21 



21- 



and 



26 V 
Ri^Vobs) = ( g- J 



(—J -211n(//)+21 , 



respectively. Figures 2 and 3 contains graphs of these relative likelihood 
functions. 

2.2. Problem of estimation. 
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Fig 3. Leukemia data: Relative log-likelihood function. 
6 8 10 12 14 16 




Meaning of 'estimate '. Fisher developed his theory of statistical estima- 
tion during 1912-56 by first inventing important concepts in his first four 
theoretical articles (Fisher 1912, 1915, 1920, 1921). In these articles, the 
'problem of estimation' is not explicitly mentioned, but it will be argued 
that the problem and its solution appears in the last of these articles. In 
considering the development of Fisher's ideas and ideology it is important 
to separate concepts from the words Fisher used to name them. His termi- 
nology developed during these years and stabilized after ten years in Fisher 
(1922). The concepts, however, already appear during 1912-1921 in almost 
completely 'modern' form. Table 2 contains information on the development 
of concepts and their terminology for some important cases. 

In Fisher (1956; see also 1921, p. 3; 1922, p. 313, 1925a, p. 8; 1925b, p. 
701, 1935, p. 40) on page 49 the problem of estimation is defined. 

. . . when the general hypothesis is found to be acceptable, and accepting it as 
true, we proceed to the next step of discussing the bearing of the observational 
record upon the problem of discriminating among the various possible values 
of the parameter, we are discussing the theory of estimation itself. 

Already one should note that Fisher does not speak of picking one parameter 

value, but instead seems to be thinking of division of possible parameter 

values into sets. 

In his first four articles (Fisher 1912, 1915, 1920, 1921) Fisher developed 
this solution by first introducing the concepts of likelihood and likelihood 
function in Fisher (1912). At the end of Fisher (1921, p. 25) he writes 

"Probable errors" attached to hypothetical quantities should not be inter- 
preted as giving any information as to the probability that the quantity lies 
within any particular limits. When the sampling curves are normal and equiv- 
ariant the "quartiles" obtained by adding and sub-tracting the probable er- 
ror, express in reality the limits within which the likelihood exceeds 0.796,542, 
within twice, thrice, four times the probable error the values of the likelihood 
exceed 0.402,577, 0.129,098, and 0.026,267; within once, twice, and thrice the 
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Table 2 
Development of concepts and their terminology. 



Modern term 


Year 


Parameter 


Model 
function 


Statistic 


Sampling 
distribution 


Likelihood 


Likelihood 
function 


MLE 


1912 


arbitrary 
elements 


function of 

known 

form 




frequency 
curve 


inverse 
probability 


inverse 

probability 

system 


most 
probable 
set of 
values 


1915 




frequency 
distribution of 
the population 


statistical 
derivate 


frequency 
distribution 






most 
likely 
value 


1922 


parameter 


hypothetical 

infinite 

population 


statistic 


distribution of 
statistics 
derived from 
samples 


likelihood 




optimum 
value of 
parameter 



standard error, they exceed 0.606,051, 0.135,335 and 0.011,109. 

Professor Anthony Edwards has checked the numbers and has noted that in 
case of the first standard error instead of 0.606 051 there should be 0.606 531. 
Other numbers are correct. 

In Fisher (1956; see also 1922, p. 327; 1956, p. 53) on pages 69-72 a 
solution of the problem of estimation is given. On page 72 Fisher writes 

In the case under discussion a simple graph of the values of the Mathematical 
Likelihood expressed as a percentage of its maximum, against the possible 
values of the parameter p, shows clearly enough what values of the parameter 
have likelihoods comparable with the maximum, and outside of what limits 
the likelihood falls to levels at which the corresponding values of the parameter 
become implausible. 

These quotations show that for Fisher the solution of the problem of esti- 
mation consisted of a collection of likelihood regions, that is, an inferential 
statement states that the unknown parameter (vector) belongs to a likeli- 
hood interval (region). Aldrich (1997) discusses the first quote and makes 
the same inference that Fisher was speaking about likelihood intervals. 

This is, however, the solution only when the whole parameter vector is 
of interest and even in that case it lacks the all important assessment of 
uncertainty of the statement. In Fisher (1915, 1920, 1921) considered the 
case of real-valued interest functions, that is, correlation coefficient (1915, 
1921) and standard deviation (1920). His solution in these articles was the 
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WHAT DID FISHER MEAN BY AN ESTIMATE? 11 

construction of a statistic the sample distribution of which depended only 
on the interest function. The statistics chosen were naturally the sample 
correlation coefficients and the sample standard deviation. 

At the beginning of his statistical writing Fisher wanted to express statis- 
tical inference, that is, statements concerning the unknown value of interest 
function in the from (1.1). There are at least three natural reasons for this, 
namely, first Fisher himself had learned statistics via Bayesian paradigm 
and so must have had strong inclination to use (1.1). Secondly this way of 
stating the inference should be easier for others to understand. Thirdly it 
gave a way to assess the uncertainty of the statement. Surely using (1.1) to 
express the inference was very problematic because people intended to think 
Fisher was just applying the Bayesian paradigm. 

In addition to the problem of convincing others that he was not using the 
Bayesian paradigm, in which he did not succeed well, Fisher had a couple 
of other problems. First was the question of the interpretation of (1.1). 
Bayesian interpretation treats 9 as random and 9 as fixed. Fisher instead 
treated 9 as fixed and 9 as random. The second problem was that interval 

(2.7) 9 ± k se § 

for some real number A; is a likelihood interval only if 9 has a normal dis- 
tribution with constant standard deviation, that is, se^ does not depend on 
9. Now, however, the invariance property of likelihood, which perhaps was 
the main motivation in Fisher (1912), came to rescue. Thus Fisher sought 
transformations tp = g{9) and ip = g{9) such that ijj had (approximately) 
normal distribution with (approximately) constant standard deviation. Then 
he constructed a likelihood interval for ip using (2.7) and then transformed 
that to a likelihood interval for 9, that is, 

(2.8) g- 1 $±keej). 

At the same time, the problem of assessment of uncertainty was dealt by 
choosing the real number k to be some quantile of standard normal distri- 
bution. It must be admitted, however, that Fisher in his earlier articles used 
the expression 

(2.9) if) ± probable error of tp 
and later the expression 

(2.10) ip ± standard error of ip. 

In his fourth article (Fisher 1921) on page 12 Fisher writes 
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It would of course be possible to render the statement of the possible error of 

r less misleading by writing ' instead of ±0.1213; that is by stating 

the actual quartile distances. Such a change, though certainly more accurate, 
and giving at any rate a danger signal as to the nature of the distribution, 
does not describe it effectively. Although two numbers are given, they contain 
less information than a single probable error when the distribution is normal; 

That he interpreted (2.9) and (2.10) along with corresponding expres- 
sions for 9 as collections of likelihood intervals he made clear at the end of 
Fisher (1921). That the expressions (2.9) and (2.10) implicitly contained an 
assessment of uncertainty of statements is evident. 

Now the above explanation of what Fisher meant by the 'problem of 
estimation' and what was his solution to that problem explains his termi- 
nology, that is, the meaning he gave to the word 'estimate'. Thus for Fisher 
'estimate', 'solution of estimation', and 'problem of estimation' meant the 
following. 

Estimate consists of the observed values of some set of statistics which 

jointly define a function from sample space to parameter space. 
Solution of estimation consists of determination of likelihood regions from 

the observed likelihood function based on the sampling distribution of 

those statistics. 
Problem of estimation consists of deriving a method that produces best 

estimates. 

So even though the observed value of an 'estimate' belonged to the param- 
eter space that estimate for Fisher was not a point estimate. Instead this 
observed value of the estimate and its sampling distribution were carriers of 
information to be used in statistical inference. In the discussion of Savage 
(1976) Oscar Kempthorne expresses the same opinion 

Savage alluded, appropriately, to obscurity on what Fisher meant by "esti- 
mation." My guess is that he meant the replacement of the data by a scalar 
statistic T for the scalar parameter 6 which contained as much as possible of 
the (Fisherian) information on 9 in the data. But what one should do with an 
obtained T was not clear, though Fisher was obviously not averse at times to 
regarding T as an estimator of 9. It is interesting, as Savage noted, that Fisher 
was the first to formulate the idea of exponential families in this connection. 
Here, also, the fascinating question of ancillaries arises, and on this Fisher was 
most obscure. 

and similarly Ian Hacking in his book 'Logic of Statistical Inference' (1965, 
p. 173). 

But in Fisher's opinion, an estimate aims at being an accurate and extremely 
brief summary of the data bearing on the true value of some magnitude. 
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Closeness on the true magnitude seems to be conceived as a kind of incidental 
feature of estimates. 

Thus from Fisher's point of view, he is quite correct when he writes 'if an 
unknown parameter 6 is being estimated, any one-valued function of 9 is nec- 
essarily being estimated by the same operation. The criteria used in the theory 
must, for this reason, be invariant for all such functional transformations of 
the parameters'. He is right, for by an estimate he means a summary of the 
data. 

It is often a trifle hard to see why what Fisher calls estimates should in any 
way resemble what are normally called estimates. And in fact his estimates 
based on small samples consist of a set of different numbers, only one of which 
resembles what we call an estimate, and the others of which are what he calls 
'ancillary statistics', namely supplementary summaries of information. 

Both Kempthorne and Hacking consider the case of one-dimensional param- 
eter and in fact even though the above formulation of 'problem of estimation' 
includes also the case of a higher dimensional parameter it seems that Fisher 
never considered confidence regions in higher dimensional parameter spaces. 
What Kempthorne considered 'obscure' is not that anymore if the word 
'estimate' is interpreted the way presented above and Fisher's solution to 
the 'problem of estimation' which he gave at the end of Fisher (1921) is 
accepted. Even the 'question of ancillaries' has a natural explanation that 
will be considered in the next subsection. 

Solution of the problem of estimation. When comparing mean error 

I — n 

(2.H) \h;Y,\y*-y\/ n 

v z i=i 

with square root of mean square error 
(2.12) 



N 



^2(yi -y) 2 / n 

2=1 



as estimates (Fishers's meaning) of standard deviation in case of a sample 
from normal distribution with unknown mean and variance, Fisher hit, it 
may be said almost accidentally, on the concept of sufficiency (Fisher 1920, 
p. 768). The modern definition of sufficiency is the following. 

Sufficiency A statistic S is sufficient for the parameter vector to if the 
conditional distribution of every other statistic T with respect to S 
is independent of the parameter vector u, that is, if the likelihood 
function obtained from the sampling distribution of S is identical to 
the likelihood function obtained from the observed response and its 
statistical model. 
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For Fisher the term sufficient statistic, however, had a slightly different 
meaning. In addition to the term sufficient statistic Fisher used also the 
term exhaustive statistic and these had the following definitions. 

Sufficient statistic is a statistic S with dimension equal to that of the pa- 
rameter vector u> such that the conditional distribution of every other 
statistic T with respect to S is independent of the parameter vector lj, 
that is, a statistic S such that the likelihood function obtained from 
the sampling distribution of S is identical to the likelihood function 
obtained from the observed response and its statistical model. 

Exhaustive statistic is a statistic E which is sufficient in the modern 
meaning of the word such that it can be written in the form E = (S, A) 
where S has the same dimension as the parameter vector u and the 
sampling distribution of A is independent of the parameter vector lo, 
that is, the likelihood function obtained from the conditional sampling 
distribution of S given the observed value of A is identical to the 
likelihood function obtained from observed response and its statistical 
model. 

Thus for Fisher a d-dimensional statistic S is sufficient if it is sufficient in 
the modern meaning of the word. Now, because Fisher's estimate meant a 
(i-dimensional statistic taking values from the parameter space, he used in 
fact the term sufficient estimate. 

In both cases Fisher's aim was an estimate (Fisher's meaning) such that 
the likelihood function obtained from the conditional sampling distribution 
of the estimate will exhaust all the information on the parameter vector 
contained in the statistical evidence. In case of Fisher's sufficiency there is 
no need to condition and so the estimate is sufficient because it exhausts all 
the information. 

In Fisher (1920) it was shown that square root of mean square error is 
the sufficient estimate for standard deviation. 

2.3. Theory of statistical estimation. 

Background. Fisher (1922) is the first of Fisher's articles that considered 
the problem of estimation in a systematic way and contains first version of a 
chain of ideas that Fisher called Theory of statistical estimation and which 
was meant to give a justification for his solution of the problem of estima- 
tion. The previous four articles (1912, 1915, 1920, 1921) already contained, 
however, seeds of these ideas and especially four basic 'principles' that form 
Fisher's 'ideology' which he kept talking about for the rest of his life. The 
principles were the following. 
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No prior distribution Statistical inference on unknown parameter (s) should 
be done without assuming any prior distribution for the unknown 
parameters, unless the prior distribution was based on real physical 
knowledge of the situation. 

Invariance Statistical inference should be invariant with respect one-to-one 
transformations of parameters. 

Efficiency Statistical inference should be efficient so that all the informa- 
tion in the observations should be used in an 'optimal' way. 

Small-samples Statistical inference should be exactly applicable in small- 
samples. 

In early papers, these principles were more Fisher's reactions against cer- 
tain common ways of thinking. The first principle was based on the writ- 
ings of Boole (1854), Venn (1866), etc. and expresses Fisher's conviction 
that "the theory of inverse probability is founded upon an error, and must 
be wholly rejected (Fisher 1925a, p. 9)." Here the term inverse probability 
refers to Laplacean way of using 'uniform' prior distributions when there is 
no knowledge about the parameters. It must, however, be said that Fisher 
did not reject Bayesian inference altogether. On the contrary, he regarded it 
highly, but insisted that it was applicable only when prior distribution could 

be based on real physical knowledge and not on the lack of it. 

Fisher's favorable thoughts about Bayes and Bayesian inference have an 
explanation in his ideas about the uncertainty of statistical inference. In 
Fisher (1956, p. 40) he writes the following. 

While, as Bayes perceived, the concept of Mathematical Probability affords 
a means, in some cases, of expressing inferences from observational data, in- 
volving a degree of uncertainty, and of expressing them rigorously, in that 
the nature and extent of the uncertainty is specified with exactitude, yet it 
is by no means axiomatic that the appropriate inferences, though in all cases 
involving uncertainty, should always be rigorously expressible in terms of this 
same concept. 

In this quotation the term Mathematical Probability means a probability 
model, that is, a sample space of outcomes, a collection of events, and a 
probability measure defined on the collection of events giving the distri- 
bution. As the quotation indicates according to Fisher there are different 
situations which afford different kinds of measures of uncertainty. Math- 
ematical Probability is on the top of these measures of uncertainty. This 
explains Fisher's positive attitude to Bayesian inference and his theory of 
fiducial probability which is an attempt to produce statistical inferences in 
the form of Mathematical Probability without assuming prior distribution 
for parameters. Other forms of uncertainty include Mathematical Likelihood, 
significance levels, and confidence levels. In these latter cases, the calculated 
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numbers are measures of uncertainties of the conclusions made, but they 

cannot be expressed in the form of Mathematical Probability because there 

is no collection of events and probability measure on it. 

The principle of invar iance appeared already in Fisher (1912) and in 
Fisher (1956, p. 146) it is said that 

... if an unknown parameter 6 is being estimated, any one- valued function of 
9 is necessarily being estimated by the same operation. The criteria used in 
the theory must, for this reason, be invariant for all such functional transfor- 
mations of the parameters. 

This principle has been the most important from the beginning of Fisher's 
statistical writing. The word 'absolute' appearing in the title of Fisher (1912) 
refers to invariance (Aldrich 1997 and Stigler 2005). Fisher used the concept 
of invariance also to downplay the importance of concepts like unbiasedness, 
etc. 

One of the reasons of writing Fisher (1912) was to present an alternative 
to the method of moments and to show that method of moments was not a 
method that generally produced efficient solutions. In Fisher (1922) he gave 
a dramatic example of the failure of method of moments by showing that 
the use of the sample mean calculated from the sample from Cauchy distri- 
bution with unknown location and known scale was equivalent to basing the 
inference on just one observation and discarding all the other observations 
in the sample. 

From the beginning of his statistical writing Fisher wanted to develop 
solutions to small-sample situations. Before Fisher came to the scene, most 
methods of statistical inference were applicable when samples were large. 
The first paper on small-samples was Student (1908). Fisher admired the 
work of Gosset and the influence of this work was Fisher's emphasis on finite 
samples and his view that the duty of statistics and statisticians is to provide 
exact methods of statistical inference that scientist can use to analyze their 
data, even when that data consists of a small sample. 

Fisher's logic of inductive inference. Fisher's Theory of statistical esti- 
mation is a chain of ideas intended to justify his solution of the problem of 
estimation. In Fisher (1935, p. 41) he thought it necessary to 

. . . show how it is that a consideration of the problem of estimation, without 
postulating any special significance for the likelihood function, and of course 
without introducing any such postulate as that needed for inverse probability, 
does really demonstrate the adequacy of the concept of likelihood for inductive 
reasoning, in the particular logical situation for which it has been introduced 

Stigler (2005) contains a thorough discussion about the developments that 
led to Fisher (1922) in which Fisher presented first version of theory of 
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statistical estimation. Stigler's view about the motive behind the theory of 

statistical estimation is perfectly in line with views presented in this article. 
The chain of ideas in Fisher's Theory of statistical estimation naturally 
divides in two parts. In Fisher (1935, p. 41) he describes this chain in the 
following way. 

This logical characteristic of our approach naturally requires that our edifice 
shall be built in two stories. In the first we are concerned with the theory of 
theory large samples, using this term, as is usual, to mean that nothing that we 
say shall be true, execpt in the limit when the size of the sample is indefinitely 
increased; a limit, obviously, never attained in practice. This part of the theory, 
to set off against the complete unreality of its subject-matter, exploits the 
advantage that in this unreal world all possible merits of an estimate may be 
judged exclusively from its variablity, or sampling variance. In the second story, 
where the real problem of finite samples is considered, the requirement that 
our estimates from these samples may be wanted as materials for a subsequent 
process of estimation is found to supply the unequivocal criteria required. 

In this quotation, italics come from Fisher and is important because it em- 
phasizes Fisher's often expressed opinion that ideas of the first 'story' are 
theoretical and were not intended to be used in finite samples. 

In the first story there appears three concepts, namely, consistency, effi- 
ciency, and expected information or Fisher information. Consistency means 
that in the limit estimate becomes constant that is equal to the parameter 
function of interest. Efficiency means that the asymptotic variance of the 
estimate is as small as possible. Technically, as 'sample size' n increases the 
limiting value of nV, where V stands for the variance of our estimate, shall 
be as small as possible (Fisher 1935, p. 42). Fisher showed that for any 
consistent estimate 

where 



var 



duj 

is the Fisher information. Fisher showed that MLE is consistent and its 
limiting variance is equal to the reciprocal of Fisher information. Thus in 
the limit MLE is the best, but this result concerns what happens in an unreal 
world. 

One may ask why Fisher included in his theory of statistical estimation 
the first story of large samples. An explanation might be that he wanted to 
show the connection between his theory and the standard interval (1.1). It 
is well-known that in most cases the asymptotic distribution of an estimate 
6 is normal. In order that (1.1) would be a likelihood interval for 9 the mean 
of the asymptotic normal distribution should be exactly or at least approx- 
imately equal to 6 and variance of it should be exactly or approximately 
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independent of 9. The first requirement is true if the estimate is consistent. 
The other reason may be that the concept of Fisher information, which was 
introduced in the first story and had there a rather simple interpretation, 
was applicable also in the second story, that is, in small samples. 

The first 'story' in Fisher's theory of statistical estimation was just a 
prelude to the discussion of the real problem in the second 'story'. In Fisher 
(1935, p. 46; see also 1925a, p. 15; 1925b, p. 712; 1956, p. 46, 158, 163) he 
says the following. 

We are now in a position to consider the real problem of finite samples. For 
any method of estimation has its own characteristic distribution of errors, not 
now necessarily normal, and therefore its own intrinsic accuracy. 

Because in finite samples the sampling distributions of the various possible 
estimates can have widely different forms and can no more be compared 
with the help of standard deviation Fisher needed some new criterion for 
the comparison of 'efficiency' of different estimates. Early on he noticed that 
the concept of Fisher information was general and can be determined in all 
situations. So in the second 'story', that is, in the real problem of finite 
samples Fisher information replaced asymptotic variance as a tool used to 
rank estimates. In Fisher (1935, p. 46; see also 1925a, p. 314, 338; 1925b, p. 
709, 712, 714; 1956, p. 153, 157) he wrote 

This quantity i, which is independent of our methods of estimation, evidently 
deserves careful consideration as an intrinsic property of the population sam- 
pled. In the particular case of error curves, or distributions of estimates of the 
same parameter, the amount of information of a single observation evidently 
provides a measure of the intrinsic accuracy with which it is possible to eval- 
uate that parameter, and so provides a basis for comparing the accuracy of 
error curves which are not normal, but may be of quite different forms. 

Fisher showed that 

< ig(6) < i(9), 

where 9 is some estimate of 9. Also i{9) is the Fisher information calcu- 
lated from the original data and i§(9) is that calculated from the sampling 

distribution of the estimate 9. When 9 is a sufficient estimate its Fisher in- 
formation equals that of original data. In this case solution of the problem 
of estimation consists of the likelihood function obtained from the sampling 
distribution of the maximum likelihood estimate evaluated at the observed 
value of MLE. But from sufficiency follows that this likelihood function is 
just equivalent to the likelihood function obtained from the original data. 
In Fisher (1935, p. 47) it is said that 

Having obtained a criterion for judging the merits of an estimate in the real 
case of finite samples, the important fact emerges that, though sometimes the 
best estimate we can make exhausts the information in the sample, and is 
equivalent for all future purposes to the original data, yet sometimes it fails to 
do so, but leaves a measurable amount of the information unutilized. How can 
we supplement our estimate so as to utilize this too? It is shown that some, 
or sometimes all the lost information may be recovered by calculating what 
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I call ancillary statistics, which themselves tell us nothing about the value of 
the parameter, but instead, tell us how good an estimate we have made of it. 

In (Fisher, 1922) Fisher thought first that MLE is always sufficient esti- 
mate, but realized almost immediately this is not true. In this case 

i $ (d) < m, 

and the likelihood function constructed from the sampling distribution MLE 
does not contain all the relevant information about the unknown parameter. 
So Fisher had a serious problem of how the lost information could be recov- 
ered. In (Fisher, 1934) he found two important special cases in which the 
lost information could be recovered by considering the conditional sampling 
distribution of MLE given the observed values of certain ancillary statistics, 
that is, statistics the marginal distribution of which did not depend on the 
unknown parameter. He showed that the expected value of the conditional 
Fisher information calculated from the conditional sampling distribution of 
MLE is equal to the Fisher information of the original data. Now also in 
this case the solution of the problem of estimation consists just of the orig- 
inal likelihood function because the likelihood function obtained from the 
conditional sampling distribution of MLE is equal to the original likelihood 
function. 

The net result of Fisher's Theory of statistical estimation is that solution 
of the problem of estimation consists of likelihood intervals obtained from 
the likelihood function of data. After discussing the role ancillary statistic 
in (Fisher 1956, p. 161) Fisher gives the following summary. 

... it is the Likelihood function that must supply all the material for estima- 
tion, and that the ancillary statistics obtained by differentiating this function 
are inadequate only because they do not specify the function fully. 

2.4. Example. As an example, Fisher (1925, p. 705) contained the com- 
parison of sample mean and median as estimates in the sense Fisher meant 
by estimates. Assume that the response vector is a sample of size n = 11 
from some normal distribution with unknown mean \x and known standard 
deviation a = 1. The log- likelihood function l\ calculated from the sampling 
distribution of the sample mean y has the form 

h(^) = -n(y-fi) 2 /2 

and the log-likelihood function li calculated from the sampling distribution 
of the sample median y has the form 

h{p) = -(y-fi) 2 /2 + (n-l)H<S>(y-fi))/2 + 
(n-l)ln(l-$(y-/i))/2. 
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Fig 4. Log-likelihood functions of mean obtained from sampling distribution of sample 
mean (continuous curve) and sample median (dashed curve). 




Because here the purpose is to compare the shape of these log-likelihood 
functions assume without restriction that y = y = 0. Figure 4 shows the 
graphs of these log-likelihood functions and from these it is clearly seen 
that the sample mean contains more information on ^ than the sample 
median. In fact, as is well-known, the sample mean is in this model sufficient 
statistic, both in Fisher's and modern meaning, and thus contains all the 
information on \i there is in the statistical evidence. The sample median 
instead loses some of this information and is an inefficient estimate. The 
asymptotic variances of the estimates are a 2 jn and ira 2 /2n, respectively. 
The asymptotic efficiency of the sample median is thus 2/-7T = 63.33%. 

3. Prevailing anti-fisherian view of MLE or Maximum Likeli- 
hood Estimate Method. 

3.1. Delta method and Wald confidence interval. Let g(u>) be a given real 
valued interest function with ijj as value. The most often used procedure to 
construct a confidence interval for ip is so-called delta-method. The delta- 
method interval has the form 



(3.1) 



V> G i> =F z t 



'dg(u>) 



a / 2 V duj 



J(u) 



.i gg(g) 



doj 



where ip = g(u>) is the maximum likelihood estimate of ip, J{w) the ob- 
served information matrix, and z* , 2 the (1 — a/2)-quantile of standard nor- 
mal distribution. When interest function consists of a component u>i of the 
parameter vector oj, the above interval takes the form 



(3.2) 



Ui G OJi =F Z a /2 s u 
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where the standard error of estimate 



(3-3) as>i = ^(J(&)- l h 

is the square root of the i th diagonal element of the inverse J(o)) -1 . This 
latter confidence interval is better known by the name Wald confidence in- 
terval. 

Wald confidence interval is derived from the asymptotic normal distribu- 
tion of the maximum likelihood estimate u>, that is, 

(3.4) Cj ~ MultivariateNormalfd, oj,I(uj)~ ], 

where I(uj) is the Fisher expected information matrix which can be esti- 
mated by the observed information matrix J {Co). Thus first we have from 
the asymptotic normal distribution 



Ui — w. 



and then by inserting the estimate 

Q)i — uj. 



= ~ NormalModel[0, 1] 



' ~ NormalModelfO, 1]. 



from which the Wald confidence interval follows. 

The delta-method interval is then obtained from the Taylor-series expan- 
sion 

; ,„, , . dg(uj) T A . 
ip = g(u) re g{U!) + {u)-u>), 



which gives 

g(u) — NormalModel 
So 






duj duj 



■ ~ NormalModel [0, 1] 



dgH T jr. ,\-i dg(u) 



from which by replacing the denominator by its estimator we get 

ip — ip 



s^r J \ UJ > sur 



NormalModel [0, 1]. 
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It can be shown that Wald and delta-method intervals are approximations 
of profile likelihood-based confidence intervals. These approximations have, 
however, some serious drawbacks, the most serious of which is that they 
are not invariant with respect to monotone transformations of the interest 
function. Secondly, these intervals often include impossible values of the 
interest function. 

3.2. MLE-based and likelihood-based confidence intervals. The following 
quotations from books which have been written by men who promote likeli- 
hood approach show general view on Fisher's 'Method of Maximum Likeli- 
hood'. In most other theoretical books and articles on maximum likelihood 
estimation, the approach is similar and even more uncompromising in the 
sense that they do not contain reservations expressed by Anthony Edwards, 
James Lindsey, and Yudi Pawitan. 

Edwards writes in Likelihood (1972, p. 98) 

He [Fisher] advocated what he later called the Method of Maximum Likelihood 
in his very first paper, as a means of point estimation. 

Lindsey in Parametric Statistical Inference (1996, p. 81) 

The maximum likelihood estimate can be looked upon as a point estimate. 
However, like any point estimate, in most contexts, outside of those just men- 
tioned, it is often of little use because many other models will be almost as 
likely. We need to look at the form of the whole likelihood function, . . . 

and Pawitan in In All Likelihood: Statistical Modelling and Inference Using 
Likelihood (2001, p. 30) 

Fisher (1922) introduced likelihood in the context of estimation via the method 
of maximum likelihood, but in his later years he did not think of it as simply 
a device to produce parameter estimates. 

Contrary to what Edwards, Lindsey, and Pawitan say by the term 'Method of 
Maximum Likelihood' Fisher meant that maximum likelihood estimate gives 
the solution to his 'problem of estimation', that is, likelihood/confidence 
regions must be formed using the observed likelihood function obtained from 
the (conditional) sampling distribution of the maximum likelihood estimate. 
Pawitan (2001, p. 42) suggests the term MLE-based regions (intervals) to 
those formed from asymptotic normal distribution of the maximum likeli- 
hood estimate, that is, to Wald confidence regions (intervals) and uses the 
term likelihood-based confidence regions (intervals) for those obtained from 

likelihood function. In this article, this terminology is used. 

The idea that by Method of Maximum Likelihood Fisher meant point es- 
timation or confidence intervals based on an asymptotic normal distribution 
of MLE is in a complete contradiction with the fact that his Theory of Sta- 
tistical Estimation was developed to justify the use of likelihood function in 
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statistical inference. It is extremely odd to think that Fisher at the same 
time would have been promoting two solutions to his Problem of Estimation, 
especially if we take to account his life long emphasis of finite samples. Every 
time Fisher discussed the theory of statistical estimation when he started 
to discuss the case of finite samples he noted the unimportance of the large 
sample concepts. In Fisher (1956), for example, on page 147 he said that 

In fact, the asymptotic definition is satisfied by any statistic whatsoever ap- 
plied to a finite sample, and is useless for the development of a theory of small 
samples. 

and on page 159 

The theory of large samples can, however, never be more than a first step 
preliminary to the study of samples of finite size, .... 

3.3. Example. Assume that the remission times in the control group of 
leukemia data may be modeled as a sample from some gamma distribution 
and that the standard deviation of the gamma distribution is the parameter 
of interest, that is, 

where A and /i are the shape and mean parameters of the gamma distri- 
bution, respectively. Now the maximum likelihood estimate of ip is 6.763 
and approximate 95%-level profile likelihood-based and delta-method confi- 
dence intervals for ip are (4.634,11.297) and (3.829,9.697), respectively. In 
a simulation of 10000 samples from gamma distribution with shape 1.642 
and mean 8.667 the actual coverage probabilities were 0.938 and 0.837 for 
profile likelihood-based and delta-method intervals, respectively. 

An explanation for the weak performance of delta-method in this case can 
be seen from Figure 5, which shows the log-likelihood function of ip for the 
actual data. The graph of log-likelihood function is asymmetric and so the 
profile likelihood-based confidence interval is also asymmetric with respect to 
the maximum likelihood estimate ip. The delta-method interval is forced to 
be symmetric and because of this it often misses the 'true' value of ip because 
of a too low upper limit. When applying delta-method, it is important to use 
an appropriate scale for the interest function by transforming first ip to A = 
h(ip) so that the distribution of A = h(ifj) is better approximated by normal 
distribution, especially with approximately constant variance. Then using 
the delta-method interval (A L , X u ) for A a better interval (h~ 1 (X L ),h~ 1 (X u )) 
for ip is obtained. When applying the profile likelihood method, there is no 
need, however, to make any transformations, because of the invariance of 
profile likelihood-based intervals. In a sense, the profile likelihood method 
automatically chooses the best transformation and the user need not worry 
about that (Pawitan 2001, p. 47). 
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Fig 5. Log-likelihood function of standard deviation of gamma distribution calculated from 
control group of leukamia data. 

Standardized log— likelihood 




4. Current state of Fisher's theory. 

4.1. Fisher's problem of estimation after 1956. Fisher lacked a system- 
atic way of getting solution in case of general real- valued interest function 
g(w) of multidimensional parameter. In Fisher (1922, p. 313) he briefly men- 
tions the problem. 

There is one point, however, which may be briefly mentioned here in advance, 
as it has been the cause of some confusion. . . . normal population of two cor- 
related variates will usually require five parameters for its specification, the 
two means, the two standard deviations, and the correlation; of these often 
only the correlation is required, or if not alone of interest, it is discussed with- 
out reference to the other four quantities. In such cases an alteration has been 
made in what is, and what is not, relevent , and it is not surprising that certain 
small corrections should appear, or not, according as the other parameters of 
the hypothetical surface are not deemed relevant. 

Fisher used marginal and conditional sampling distributions of carefully 
selected, but in anyway somewhat ad hoc, statistics and used likelihood func- 
tions obtained from these sampling distributions. These likelihood functions 
are special cases of so-called 'pseudo likelihood functions' (Pace and Salvan 
1997, pp. 131-162). The term 'pseudo likelihood function' means a function 
of the interest parameter that is used instead of the original likelihood func- 
tion. Marginal and conditional likelihood functions, are, however, genuine 
likelihood functions and the term 'pseudo' reflects the fact that they are used 
in place of original likelihood function and often do not contain all the infor- 
mation on the interest function included in statistical evidence. In addition 
there are other pseudo likelihood functions which are not likelihood functions 
at all. The prominent member of these other pseudo likelihood functions is 
profile likelihood function, which was explicitly introduced in Box and Cox 
(1962), but was implicitly used in connection of Wilk's (general) likelihood 
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ratio statistic 

w{ip; y bs) = 2{/(w; y obs ) - l(u>; y obs )} 

for the statistical hypothesis H : g(u>) = ifi, where the function 
lp{ip\ Vote) =l(u; yobs) = max l(u;;y obs ) 

u:g(w)=ip 

is called logarithmic profile likelihood function or profile log-likelihood func- 
tion of ij) = g(io). 

As discussed above the MLE-based intervals are against Fisher's ideology, 
because without appropriate transformations the intervals are not likelihood 
intervals. On the other hand likelihood-based intervals calculated from (pro- 
file) likelihood function are in accordance with Fisher's ideology. 

4.2. Profile likelihood function. Let y b s be the observed response and 
•M = {piy'i 1 ^) : y £ }*,w £ fi} its statistical model. In most practical 
problems only part of the parameter vector or more generally the value of a 
given function of the parameter vector is of interest. 

Let g(u>) be a given interest function with g-dimensional real vector ip as 
value. Then the function 

(4.1) L g (ijj;y ohs ) = max L(uj;y ohs ) 

{u)&Vl:g(u)=il)\ 

= £(£ty;yobs) 

is called the profile likelihood function of the interest function g induced 
by the statistical evidence (y bs >-M). The value dty, of the parameter vector 
maximizes the likelihood function in the subset {u G £1 : g(uj) = ip} of the 
parameter space. The function 

(4.2) lg(ip;y bs) = , max l(u>;y bs) 

{uj£Q:g(uj)=if>\ 

= ^(£ty;y bs) 

= ln(L 9 (^;y obs )) 

is called the logarithmic profile likelihood function or profile log-likelihood 
function of the interest function g induced by the statistical evidence (y bsj -M). 
Furthermore functions 

(a i\ r (,h. „ \ - Lgi^Vobs) __ L g (ip;y ohs ) 

l 4 -^) ttg{y,yobs) - — — 7 - -rp; r- 

L g {ip;y bs) L{u;y ohs ) 

and 

(4-4) r g (ip;y ohs ) = l g {ip;y bs) - l g (^;y ba) 

= lg(^;y bs) -l(&;Vobs) 
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are called the relative profile likelihood function and the logarithmic rela- 
tive profile likelihood function or relative profile log-likelihood function of 
interest function g, respectively. Because in this article no other likelihood 
functions except actual profile likelihood functions of various interest func- 
tions g are considered the phrases (relative) likelihood function and (relative) 
log-likelihood function of g are used. 

When the interest function g is real valued the parameter vectors city form 
a curve in the parameter space. This curve is called the profile curve of the 
interest function g. 

4.3. Profile likelihood region and its uncertainty. With help of the rela- 
tive likelihood function of interest function g one can construct the so called 
profile likelihood regions. The set 

{ip : R g (ip;y bs) > c} = -j> : r g (%p>;y ohs ) > ln(c)} 

of values of the interest function g(cv) is the 100c% profile likelihood region. 
The value ib = g(uj) of the parameter function does not belong to the 100c% 
profile likelihood region if the response is such that 

r g (ip;y) < ln(c). 

Probability of this event, calculated at a given value u> of the parameter 
vector, is used as a measure of uncertainty of the statement that the un- 
known value of the interest function belongs to the 100c% profile likelihood 
region, provided that this probability has the same value or at least approxi- 
mately the same value for all values of the parameter vector. One minus this 
probability is called the confidence level of the profile likelihood region and 
the region is called the (approximate) (1 — a)-level profile likelihood-based 
confidence region for the interest function. Because under mild assumptions 
concerning the interest function g(uj) and statistical model the random vari- 
able —2r g (ip;y) has approximately the x 2 [q] -distribution, the set 

\ ip ■ r g (ip;y ohs ) > ^ — \ = {^ : ~ 2r 9 (^;2/obs) < Xi- a [?]} 



rl> : l 9 W > W) 



2 



is the approximate (1 — a)-level confidence region for the interest function 
g(uj) (Severini, 2000). In some cases the distribution of —2r g (?p;y) is exactly 
the x 2 [q] -distribution and then the set is exact confidence region. Sometimes 
the random variable —2r g (ijj;y) has some other known distribution, usually 
the F-distribution. 
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5. Case of real-valued interest function. 

5.1. Profile likelihood-based confidence interval. For real valued interest 
functions the profile likelihood-based confidence regions are usually intervals 
and so those regions are called (approximate) (1 — a)-level profile likelihood- 
based confidence intervals. Thus the set 

(5.1) L : r g (i/,;y ohft ) > _*L£!J = ^^ 

forms the (approximate) (1 — a)-level profile likelihood-based confidence 
interval for the interest function tb = g(u). The statistics ip L and ihjj are 
called the lower and upper limits of the confidence interval, respectively. 

5.2. Calculation of profile likelihood-based confidence intervals. If the 
(approximate) (1 — a)-level profile likelihood-based confidence region of the 
real valued interest function ip = g{oS) is an interval, its end points ipL and 
ibu satisfy the relations ipL < ip < ipu and 

(5-2) l g $ L ) = l g {^u) = l g {$) - %^- 

Thus Tpi and ipjj are roots of the equation l g (ip) = l g (4>) ^f — • Con- 
sequently most applications of the profile likelihood-based intervals have 
determined ipL and ipjj using some iterative root finding method. This ap- 
proach involves the solution of an optimization problem in every trial value 
and depending on the method also the calculation the derivatives of the 
logarithmic profile likelihood function at the same trial value. 

The approximate (1 — a)-level profile likelihood confidence set for param- 
eter function ip = g{oo) satisfies the relation 

U : Igty) > IM - ^Y^| = {sM : w € ^i_ Q (y obs )}, 
where 
(5.3) n^ a (y ohs ) = L : l(u) > l(u>) - ^HJ 

is a likelihood region for the whole parameter vector. This result is true, 
because the real number ip belongs to the profile likelihood region, if and 
only if there exist a parameter vector u>* such that g(u>*) = ip and 

SUP l( U ) = Igty) > 1(U*) > lg$) ~ fe^lH. 

{uj^Vl:g{u))=ip} A 
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That is if and only if there exists a parameter vector uj* belonging to the 
likelihood region Tti- a {y hs) such that it satisfies the equation g(u>*) = ip. 
So the number ip belongs to the profile likelihood region, if and only if it 
belongs to the set {g(uj) : u> £ T^i~a(y bs)}- 

Assume now that the profile likelihood-based confidence region is an in- 
terval. Then the end points ipi and ipu satisfy the following relations 

(5.4) ^ L = inf g(u) 

and 

(5.5) $u = sup g(u). 

we7Ji_ a (j/ b S ) 

Let now the likelihood region be a bounded and connected subset of the 
parameter space. If the log-likelihood and the interest functions are contin- 
uous in the likelihood region, the latter with non- vanishing gradient, then 
the profile likelihood-based confidence region is an interval (tpi^u) with 

(5.6) ipL = 9(&l) = inf g{w), 

and 

(5.7) $ u = g (u u )= sup g(u). 

i{u>)=m- x i[iy2 

This follows from the assumptions, because they imply, that the set TZi- a (y bs) 
is open, connected, and bounded subset of the d-dimensional space. Thus 
the closure of 7^i-a(2/obs) is a closed, connected, and bounded set. Form the 
assumptions concerning g it follows that it attains its infimum and supre- 
mum on the boundary of ^i-a(yobs) an d takes every value between infimum 
and supremum somewhere in Tti-aiVobs)- 

Under the above assumptions the solutions of the following constrained 
minimization (maximization) problem 

(5.8) min(max)(7(io>) 
with constraint 



(5.9) l(u; y Q bs) = Z(&; y bs) 



xliM 



gives the lower (upper) limit point of the profile likelihood-based confidence 
interval. This problem is rarely explicitly solvable and requires use of some 
kind of iteration (Uusipaikka 1996, Virtanen and Uusipaikka 2008). 
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5.3. Modifications to obtain better coverage probabilities. There exists 
various ways to modify likelihood quantities so that the confidence intervals 
obtained from these modifications have better coverage properties. Bartlett- 
correction (Bartlett 1937, 1938; Box 1946) is used to modify the likelihood 
ratio statistic, Lugannani-Rice (1980) modifies the signed root deviance, 
that is, the signed square root of the likelihood ratio statistic, and many 
people have suggested various modifications of the profile likelihood func- 
tion. References to articles discussing these can be found, for example, in 
Barndorff-Nielsen and Cox (1994) and Severini (2000). 

All these modifications have been used in the way that even in the case 
of a one-dimensional parameter the obtained intervals are not likelihood in- 
tervals. Because likelihood is the primary concept and coverage probability 
a secondary one this is unfortunate. It should be possible to use these mod- 
ifications so that the obtained intervals would still be likelihood intervals, 
but with better repeated sampling properties. 

5.4. Example. Assume that the times of remission can be considered as 
samples from two exponential distributions. Statistical model under these 
assumptions is 



where 
and 



M = IndependenceModel[{A4i,A^2}], 
Ml = SamplingModel[ExponentialModel[/ii],21], 



A-12 = SamplingModel[ExponentialModel[/i2],21] 

are models for remission times of Drug and Control groups, respectively. 

Of interest in addition to model parameters might be the difference of 
means or their ratio, that is, parameter functions 

1p! = fJLi - H2 

and 

1p 2 = Ml/A«2, 

but better characteristic for describing the difference of distributions might 
be the probability that random observation from the first distribution would 
be greater than random statistically independent observation from second 
distribution. This interest function is related to the so-called Receiver Oper- 
ating Characteristic (ROC) curve, namely, it is area under this curve (AUC). 
Under current assumptions, this interest function has the form 

/ m 

Ml +^2 
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Table 3 
Estimates and approximate 0. 95-level confidence intervals of interest functions. 

Parameter function Estimate Lower limit Upper limit 
Mi 

/'2 
-01 

03 



39.889 


22.127 


83.013 


8.667 


5.814 


13.735 


31.222 


12.784 


74.444 


4.603 


2.174 


10.583 


0.822 


0.685 


0.914 


-2.820 


-8.677 


-0.634 



Fig 6. Plot of log-likelihood of ip2 
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Another parameter function describing the difference of the distributions is 
the so-called Kullback-Leibler divergence, which under current assumptions 
is 

Mi M2 
Table 3 gives maximum likelihood estimates and approximate 0. 95-level pro- 
file likelihood-based confidence intervals for parameters and interest func- 
tions. Figures 6 and 7 show graphs of log-likelihoods of ip2 and ^3> respec- 
tively. 

6. Conclusions. In this article a possible, in author's mind highly plau- 
sible and well justified, interpretation to Fisher's meaning for the word 'esti- 
mate' has been given. According to this interpretation estimate is a statistic 
taking values in the parameter space such that the observed likelihood func- 
tion obtained from the sampling distribution of this estimate is used to 
construct likelihood intervals for the parameter. Fisher's Theory of Statis- 
tical Estimation is a chain of ideas which Fisher developed to justify his 
solution to the problem of estimation. This solution is Method of Maximum 
Likelihood in which likelihood intervals are produced from the observed like- 
lihood function obtained from the (conditional) sampling distribution of the 
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Fig 7. Plot of log-likelihood ofips. 

Standardized log— likelihood 
Interest function: 




maximum likelihood estimate. As shown above the end result of Fisher's jus- 
tification is that likelihood intervals must be based on the original likelihood 
function obtained from statistical evidence so that all relevant information 
will be used. 

MLE-based confidence intervals calculated from the asymptotic normal 
distribution of the maximum likelihood estimate using delta-method is the 
current dominant approach. This approach was shown to involve Fisher's 
ideas but to be against his ideology. They are against Fisher's ideology be- 
cause generally MLE-based confidence intervals are not likelihood intervals. 
In addition these intervals are not invariant, may contain impossible values, 
and have poor coverage probabilities. 

Fisher did not have a systematic solution of the problem of estimation 
when a real- valued parameter function of a higher dimensional parameter is 
of interest. During last fifty years, various solutions based on pseudo likeli- 
hood functions have been suggested. The one that has generated most re- 
search and applications is likelihood-based inference using profile likelihood 
function. It was shown that this gives general solution to Fisher's problem 
of estimation. 

A new method of calculation of profile likelihood-based intervals was con- 
sidered. This method is a simple powerful method that can be used for gen- 
eral smooth interest functions in general statistical models. Therefore there 
is no theoretical and practical reason to calculate MLE-based confidence 
intervals instead of profile likelihood-based intervals. 

In conclusion, even though it seems to be very common opinion, for Fisher 
the method of maximum likelihood did not mean the usage of MLE as a point 
estimate or usage of the asymptotic normal distribution of MLE to construct 
confidence intervals. Instead it meant the construction of likelihood inter- 
vals or regions from original likelihood function or some pseudo likelihood 
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function and so interpreted Fisher's method of maximum likelihood is very 
much like the method of support discussed by A. W. F. Edwards in his book 
Likelihood (1972). 
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